counterfactual invariance
Counterfactual Invariance to Spurious Correlations in Text Classification
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
A Proofs
This is essentially by definition--intervention on Z doesn't change the potential outcomes, so it doesn't change the value of f (X). If f is a counterfactually invariant predictor: 1. Let L be either square error or cross entropy loss. Suppose that the target distribution Q is causally compatible with the training distribution P . Suppose that any of the following conditions hold: 1. the data obeys the anti-causal graph 2. the data obeys the causal-direction graph, there is no confounding (but possibly selection), and the association is purely spurious, Y X | X We begin with the anti-causal case.
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests Victor V eitch 1,2, Alexander D'Amour 1, Steve Y adlowsky 1, and Jacob Eisenstein 1 1
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
- Leisure & Entertainment (0.93)
- Media > Film (0.68)
A Proofs
This is essentially by definition--intervention on Z doesn't change the potential outcomes, so it doesn't change the value of f (X). If f is a counterfactually invariant predictor: 1. Let L be either square error or cross entropy loss. Suppose that the target distribution Q is causally compatible with the training distribution P . Suppose that any of the following conditions hold: 1. the data obeys the anti-causal graph 2. the data obeys the causal-direction graph, there is no confounding (but possibly selection), and the association is purely spurious, Y X | X We begin with the anti-causal case.
Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests Victor V eitch 1,2, Alexander D'Amour 1, Steve Y adlowsky 1, and Jacob Eisenstein 1 1
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
- Leisure & Entertainment (0.93)
- Media > Film (0.68)
AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis
Ma, Haroui, Quinzan, Francesco, Willem, Theresa, Bauer, Stefan
Machine learning (ML) systems for medical imaging have demonstrated remarkable diagnostic capabilities, but their susceptibility to biases poses significant risks, since biases may negatively impact generalization performance. In this paper, we introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics. Our method leverages the concept of counterfactual invariance, measuring the extent to which a model's predictions remain unchanged under hypothetical changes to sensitive attributes. We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases without requiring direct access to counterfactual data. Through experiments on synthetic datasets and large-scale real-world medical imaging datasets, including \textsc{cheXpert} and MIMIC-CXR, we demonstrate that our approach aligns closely with counterfactual fairness principles and outperforms standard baselines. This work provides a robust tool to ensure that ML diagnostic systems generalize well, e.g., across demographic groups, offering a critical step towards AI safety in healthcare. Code: https://github.com/Neferpitou3871/AI-Alignment-Medical-Imaging.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (12 more...)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Wang, Chaoqi, Zhao, Zhuokai, Jiang, Yibo, Chen, Zhaorun, Zhu, Chen, Chen, Yuxin, Liu, Jiayi, Zhang, Lizhu, Fan, Xiangjun, Ma, Hao, Wang, Sinong
Recent advancements in large language models (LLMs) have demonstrated remarkable capabilities in generating coherent, contextually appropriate responses across a wide range of tasks (Brown et al., 2020). A key approach to further refine these models is Reinforcement Learning from Human Feedback (RLHF), which leverages human evaluations to guide the training process and align model outputs more closely with human preferences (Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022; Wang et al., 2024). RLHF typically involves training a reward model to capture human preferences, which is then used to fine-tune LLMs via reinforcement learning (RL) (Schulman et al., 2017; Chen et al., 2024b,f). Despite the success of RLHF, reward modeling is inherently prone to spurious correlations, which are associations in the training data that do not reflect true causal relationships (Veitch et al., 2021), and can lead to unintended biases and induce reward hacking (McMilin, 2022). Reward hacking occurs when RL agents exploit flaws or ambiguities in the reward function to maximize rewards without genuinely improving alignment with desired behaviors or completing designed tasks (Amodei et al., 2016; Weng, 2024). Consequently, this leads to misaligned models that exhibit biases such as favoring longer outputs (length bias) (Zheng et al., 2023), agreeing with user's incorrect assertions (sycophancy bias) (Perez et al., 2022), developing unintended shortcuts when making predictions (concept bias) (Zhou et al., 2023), and implicitly developing discrimination over certain demographic groups (discrimination bias) (Tamkin et al., 2023; Chen et al., 2024c). These biases, rooted in spurious correlations and reward hacking rather than true causal relationships, undermine the reliability and trustworthiness of LLMs, posing significant challenges for their safe and responsible deployment in real-world applications (Anwar et al., 2024; Qi et al., 2024). To understand and mitigate these issues, it is essential to consider the sources of error in reward modeling.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
Counterfactual Invariance to Spurious Correlations in Text Classification
Informally, a'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions.
Out-Of-Context Prompting Boosts Fairness and Robustness in Large Language Model Predictions
Cotta, Leonardo, Maddison, Chris J.
Frontier Large Language Models (LLMs) are increasingly being deployed for high-stakes decision-making. On the other hand, these models are still consistently making predictions that contradict users' or society's expectations, e.g., hallucinating, or discriminating. Thus, it is important that we develop test-time strategies to improve their trustworthiness. Inspired by prior work, we leverage causality as a tool to formally encode two aspects of trustworthiness in LLMs: fairness and robustness. Under this perspective, existing test-time solutions explicitly instructing the model to be fair or robust implicitly depend on the LLM's causal reasoning capabilities. In this work, we explore the opposite approach. Instead of explicitly asking the LLM for trustworthiness, we design prompts to encode the underlying causal inference algorithm that will, by construction, result in more trustworthy predictions. Concretely, we propose out-of-context prompting as a test-time solution to encourage fairness and robustness in LLMs. Out-of-context prompting leverages the user's prior knowledge of the task's causal model to apply (random) counterfactual transformations and improve the model's trustworthiness. Empirically, we show that out-of-context prompting consistently improves the fairness and robustness of frontier LLMs across five different benchmark datasets without requiring additional data, finetuning or pre-training.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Learning Counterfactually Invariant Predictors
Quinzan, Francesco, Casolo, Cecilia, Muandet, Krikamol, Luo, Yucen, Kilbertus, Niki
Invariance, or equivariance to certain data transformations, has proven essential in numerous applications of machine learning (ML), since it can lead to better generalization capabilities [Arjovsky et al., 2019, Bloem-Reddy and Teh, 2020, Chen et al., 2020]. For instance, in image recognition, predictions ought to remain unchanged under scaling, translation, or rotation of the input image. Data augmentation, an early heuristic to promote such invariances, has become indispensable for successfully training deep neural networks (DNNs) [Shorten and Khoshgoftaar, 2019, Xie et al., 2020]. Well-known examples of "invariance by design" include convolutional neural networks (CNNs) for translation invariance [Krizhevsky et al., 2012], group equivariant NNs for general group transformations [Cohen and Welling, 2016], recurrent neural networks (RNNs) and transformers for sequential data [Vaswani et al., 2017], DeepSet [Zaheer et al., 2017] for sets, and graph neural networks (GNNs) for different types of geometric structures [Battaglia et al., 2018]. Many applications in modern ML, however, call for arguably stronger notions of invariance based on causality. This case has been made for image classification, algorithmic fairness [Hardt et al., 2016, Mitchell et al., 2021], robustness [Bühlmann, 2020], and out-of-distribution generalization [Lu et al., 2021]. The goal is invariance with respect to hypothetical manipulations of the data generating process (DGP). Various works develop methods that assume observational distributions (across environments or between training and test) to be governed by shared causal mechanisms, but differ due to various types of distribution shifts encoded by the causal model [Arjovsky et al., 2019, Bühlmann, 2020, Heinze-Deml et al., 2018, Makar et al., 2022, Part of this work was done while Francesco Quinzan visited the Max Planck Institute for Intelligent Systems, Tübingen, Germany.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.24)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)